Simple Tools from Extracting Quantities from Strings

Suppose we have a report and we want to find the sentences that are talking about numerical things....

Originally inspired by When you get data in sentences: how to use a spreadsheet to extract numbers from phrases, Paul Bradshaw, Online Journalism blog, from which some of the example sentences (sic!) are taken.

Distribution: https://twitter.com/paulbradshaw/status/1158752556958519297

Potentially Useful Python Packages

quantulum: extract quantities from natural language text;
ctparse: extract time / date related quantities from natural language text;
r1chardj0n3s/parse: easy scrape / regex extraction from semi-structred text using format() like patterns; example use here;
dateparser [docs]: "easily parse localized dates in almost any string formats commonly found on web pages" (includes foreign language detection);
invoice2data:

Example Sentences

Make a start on some sample test sentences...



In [152]:

    
sentences = [
    '4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months',
    'No quantities here',
    'I measured it as 2 meters and 30 centimeters.',
    "four years and six months' imprisonment with a licence extension of 2 years and 6 months",
    'it cost £250... bargain...',
    'it weighs four hundred kilograms.',
    'It weighs 400kg.',
    'three million, two hundred & forty, you say?',
    'it weighs four hundred and twenty kilograms.'
    
]

`quantulum3`

quantulum3 is a Python package "for information extraction of quantities from unstructured text".



In [153]:

    
#!pip3 install quantulum3
from quantulum3 import parser



In [154]:

    
for sent in sentences:
    print(sent)
    p = parser.parse(sent)
    if p:
        print('\tSpoken:',parser.inline_parse_and_expand(sent))
        print('\tNumeric elements:')
        for q in p:
            display(q)
            print('\t\t{} :: {}'.format(q.surface, q))
    print('\n---------\n')









    



4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months
	Spoken: four years and six months’ imprisonment with a licence extension of two years and six months
	Numeric elements:






    





Quantity(4, "Unit(name="year", entity=Entity("time"), uri=Year)")






    



		4 years :: four years






    





Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")






    



		6 months :: six months






    





Quantity(2, "Unit(name="year", entity=Entity("time"), uri=Year)")






    



		2 years :: two years






    





Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")






    



		6 months :: six months

---------

No quantities here

---------

I measured it as 2 meters and 30 centimeters.
	Spoken: I measured it as two metres and thirty centimetres.
	Numeric elements:






    





Quantity(2, "Unit(name="metre", entity=Entity("length"), uri=Metre)")






    



		2 meters :: two metres






    





Quantity(30, "Unit(name="centimetre", entity=Entity("length"), uri=Centimetre)")






    



		30 centimeters :: thirty centimetres

---------

four years and six months' imprisonment with a licence extension of 2 years and 6 months
	Spoken: four years and six months imprisonment with a licence extension of two years and six months
	Numeric elements:






    





Quantity(4, "Unit(name="year", entity=Entity("time"), uri=Year)")






    



		four years :: four years






    





Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")






    



		six months' :: six months






    





Quantity(2, "Unit(name="year", entity=Entity("time"), uri=Year)")






    



		2 years :: two years






    





Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")






    



		6 months :: six months

---------

it cost £250... bargain...
	Spoken: it cost two hundred and fifty pounds sterling, zero pence... bargain...
	Numeric elements:






    





Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)")






    



		£250 :: two hundred and fifty pounds sterling, zero pence

---------

it weighs four hundred kilograms.
	Spoken: it weighs four hundred kilograms.
	Numeric elements:






    





Quantity(400, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")






    



		four hundred kilograms :: four hundred kilograms

---------

It weighs 400kg.
	Spoken: It weighs four hundred kilograms.
	Numeric elements:






    





Quantity(400, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")






    



		400kg :: four hundred kilograms

---------

three million, two hundred & forty, you say?
	Spoken: three million, two hundred & forty, you say?
	Numeric elements:






    





Quantity(3e+06, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")






    



		three million :: three million






    





Quantity(200, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")






    



		two hundred :: two hundred






    





Quantity(40, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")






    



		forty :: forty

---------

it weighs four hundred and twenty kilograms.
	Spoken: it weighs four hundred and twenty kilograms.
	Numeric elements:






    





Quantity(420, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")






    



		four hundred and twenty kilograms :: four hundred and twenty kilograms

---------

Finding quantity statements in large texts

If we have a large block of text, we might want to quickly skim it for quantity containing sentences, we can do something like the following...



In [155]:

    
import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])



In [171]:

    
text = '''
Once upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250. 
It was blue. It took forty five minutes to get it home. 
What a day that was. I didn't get back until 2.15pm. Then I had some cake for tea.
'''



In [172]:

    
doc = nlp(text)
for sent in doc.sents:
    print(sent)









    



Once upon a time, there was a thing.
The thing weighed forty kilogrammes and cost £250. 

It was blue.
It took forty five minutes to get it home. 

What a day that was.
I didn't get back until 2.15pm.
Then I had some cake for tea.



In [173]:

    
for sent in doc.sents:
    sent = sent.text
    p = parser.parse(sent)
    if p:
        print('\tSpoken:',parser.inline_parse_and_expand(sent))
        print('\tNumeric elements:')
        for q in p:
            display(q)
            print('\t\t{} :: {}'.format(q.surface, q))
    print('\n---------\n')









    



	Spoken: 
Once upon one instance, there was a thing.
	Numeric elements:






    





Quantity(1, "Unit(name="count", entity=Entity("dimensionless"), uri=Count_data)")






    



		a time :: one instance

---------

	Spoken: The thing weighed forty kilograms and cost two hundred and fifty pounds sterling, zero pence. 

	Numeric elements:






    





Quantity(40, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")






    



		forty kilogrammes :: forty kilograms






    





Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)")






    



		£250 :: two hundred and fifty pounds sterling, zero pence

---------


---------

	Spoken: It took forty-five minutes to get it home. 

	Numeric elements:






    





Quantity(45, "Unit(name="minute of arc", entity=Entity("angle"), uri=Minute_and_second_of_arc)")






    



		forty five minutes :: forty-five minutes

---------

	Spoken: What one day that was.
	Numeric elements:






    





Quantity(1, "Unit(name="day", entity=Entity("time"), uri=Day)")






    



		a day :: one day

---------

	Spoken: I didn't get back until two point one five picometres.
	Numeric elements:






    





Quantity(2.15, "Unit(name="picometre", entity=Entity("length"), uri=Picometre)")






    



		2.15pm :: two point one five picometres

---------


---------

Annotating a dataset

Can we extract numbers from sentences in a CSV file? Yes we can...



In [1]:

    
url = 'https://raw.githubusercontent.com/BBC-Data-Unit/unduly-lenient-sentences/master/ULS%20for%20Sankey.csv'



In [2]:

    
import pandas as pd

df = pd.read_csv(url)
df.head()









    Out[2]:







  
    
      
      Year
      Offence category REFINED
      Original sentence (refined)
      Crown Court
      Outcome of Decision
      Revised?
      People
      Top 7
    
  
  
    
      0
      2015
      Drug offence
      3 years imprisonment
      Bristol
      Not referred
      No
      1
      Y
    
    
      1
      2015
      Death or serious injury - unlawful driving
      6 years imprisonment - Disqualified driving - ...
      Portsmouth
      Not referred
      No
      1
      Y
    
    
      2
      2015
      Sexual offence
      9 months imprisonment suspended for 2 years
      Nottingham
      Out of time
      No
      1
      Y
    
    
      3
      2015
      Theft offence
      4 years and 10 months imprisonment - consecuti...
      St Albans
      Not referred
      No
      1
      Y
    
    
      4
      2015
      Theft offence
      unknown
      unknown
      Not in scheme
      No
      1
      Y



In [178]:

    
#get a row
df.iloc[1]









    Out[178]:





Year                                                                        2015
Offence category REFINED              Death or serious injury - unlawful driving
Original sentence (refined)    6 years imprisonment - Disqualified driving - ...
Crown Court                                                           Portsmouth
Outcome of Decision                                                 Not referred
Revised?                                                                      No
People                                                                         1
Top 7                                                                          Y
Name: 1, dtype: object



In [179]:

    
#and a, erm. sentence...
df.iloc[1]['Original sentence (refined)']









    Out[179]:





'6 years imprisonment - Disqualified driving - 8 years'



In [180]:

    
parser.parse(df.iloc[1]['Original sentence (refined)'])









    Out[180]:





[Quantity(6, "Unit(name="year", entity=Entity("time"), uri=Year)"),
 Quantity(8, "Unit(name="year", entity=Entity("time"), uri=Year)")]



In [206]:

    
def amountify(txt):
    #txt may be some flavout of nan...
    #handle scruffily for now...
    try:
        if txt:
            p = parser.parse(txt)
            x=[]
            for q in p:
                x.append( '{} {}'.format(q.value, q.unit.name))
            return '::'.join(x)
        return ''
    except:
        return



In [207]:

    
df['amounts'] = df['Original sentence (refined)'].apply(amountify)



In [208]:

    
df.head()









    Out[208]:







  
    
      
      Year
      Offence category REFINED
      Original sentence (refined)
      Crown Court
      Outcome of Decision
      Revised?
      People
      Top 7
      amounts
    
  
  
    
      0
      2015
      Drug offence
      3 years imprisonment
      Bristol
      Not referred
      No
      1
      Y
      3.0 year
    
    
      1
      2015
      Death or serious injury - unlawful driving
      6 years imprisonment - Disqualified driving - ...
      Portsmouth
      Not referred
      No
      1
      Y
      6.0 year::8.0 year
    
    
      2
      2015
      Sexual offence
      9 months imprisonment suspended for 2 years
      Nottingham
      Out of time
      No
      1
      Y
      9.0 month::2.0 year
    
    
      3
      2015
      Theft offence
      4 years and 10 months imprisonment - consecuti...
      St Albans
      Not referred
      No
      1
      Y
      4.0 year::10.0 month
    
    
      4
      2015
      Theft offence
      unknown
      unknown
      Not in scheme
      No
      1
      Y

We could then do something to split multiple amounts into multiple rows or columns...

Parsing Semi-Structured Sentences

The sentencing sentences look to have a reasonable degree of structure to them (or at least, there are some commenalities in the way some of them are structured).

We can exploit this structure by writing some more specific pattern matches to pull out even more information.



In [6]:

    
df['Original sentence (refined)'][:20].apply(print);









    



3 years imprisonment
6 years imprisonment - Disqualified driving - 8 years
9 months imprisonment suspended for 2 years
4 years and 10 months imprisonment - consecutive to any other periods of imprisonment
unknown
unknown
3 year community sentence attend sex offenders group and pay surcharge of £60 within 2 months
£850 Fine
12-months disqualification
Community Sentence / SOPO for 5 years/ pay a surcharge of £60 within 3 months
Bound over in the sum of £100.00 for 12 months
18 months imprisonment suspended for 2 years
9 years imprisonment
13 months imprisonment
14 years and 6 months imprisonment
3 years and 9 months imprisonment
Life imprisonment with a minimum of 25 years
6 years and 3 months imprisonment
4 years imprisonment
12 months imprisonment - confiscation under POCA 2002

It makes sense to try to build a default hierarchy that extracts from more specific to less specific structures...

For example:

9 months imprisonment suspended for 2 years is more specific than 9 months imprisonment

	Year	Offence category REFINED	Original sentence (refined)	Crown Court	Outcome of Decision	Revised?	People	Top 7
0	2015	Drug offence	3 years imprisonment	Bristol	Not referred	No	1	Y
1	2015	Death or serious injury - unlawful driving	6 years imprisonment - Disqualified driving - ...	Portsmouth	Not referred	No	1	Y
2	2015	Sexual offence	9 months imprisonment suspended for 2 years	Nottingham	Out of time	No	1	Y
3	2015	Theft offence	4 years and 10 months imprisonment - consecuti...	St Albans	Not referred	No	1	Y
4	2015	Theft offence	unknown	unknown	Not in scheme	No	1	Y